Skip to content

Conversation

@SwordFaith
Copy link
Contributor

@SwordFaith SwordFaith commented Dec 31, 2025

Thank you for the efforts on radixark/miles#246 and the hard work contributed by @Ratish1!

@SwordFaith SwordFaith marked this pull request as draft December 31, 2025 04:34
@SwordFaith SwordFaith force-pushed the swordfaith/feat/add_large_sft_data_support branch from 24fb033 to 7d855c6 Compare January 3, 2026 13:55
@ChangyiYang
Copy link
Contributor

@codex review

@chatgpt-codex-connector
Copy link

To use Codex here, create a Codex account and connect to github.

@ChangyiYang
Copy link
Contributor

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ee19cc3fa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 221 to 223
# Store original content length before padding for correct unpacking
packed_sequence["content_length"] = seq_length

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve original content_length across CP padding

In CP mode, _get_model_inputs_args pads the same packed batch multiple times (e.g., once during _compute_log_prob and again during _train_step). pad_packed_sequence_with_cp overwrites content_length unconditionally, so on the second call it records the already padded length. unpack_sequences then computes pad_length as 0 and slices log_probs/entropy for the last sequence using the padded end index, which drops real tokens and includes padding when cp_size > 1, corrupting the loss/metrics for that sequence. Consider only setting content_length if it’s absent or skipping padding when the batch is already padded.

Useful? React with 👍 / 👎.

Copy link
Collaborator

@PopSoda2002 PopSoda2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your great work! I am not sure if we need to seperate datasets and SFT into different PRs, currently this PR seems too large for reviewing

if self.args.calculate_per_token_loss:
sft_loss = sum_of_token(sft_loss, response_lengths, loss_masks)
else:
sft_loss = sum_of_sample_mean(sft_loss, response_lengths, loss_masks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to calculate_per_sample_loss in SFT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further discussion, we realized that SFT should only use per-token loss.
We’ll simplify this logic by keeping token loss only here.
For users who still try to use sequence / per-sample loss in SFT, we’ll explicitly raise an error to avoid silent misconfiguration.

seed=self.args.rollout_seed,
apply_chat_template=self.args.apply_chat_template,
apply_chat_template_kwargs=self.args.apply_chat_template_kwargs,
dp_size=self._dp_size or 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not pass dp size to the data source because the data source is used in rollout manager, which does not have dp ranks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not pass dp size to the data source because the data source is used in rollout manager, which does not have dp ranks.

Apologies for the mistake, I’ll address it in the upcoming commits.

@SwordFaith
Copy link
Contributor Author

Thanks for your great work! I am not sure if we need to seperate datasets and SFT into different PRs, currently this PR seems too large for reviewing

I’ll work on some simplifications with @ChangyiYang today. Afterward, could you review it again and discuss if we should split this PR into two?

@SwordFaith SwordFaith marked this pull request as ready for review January 6, 2026 05:50
@SwordFaith SwordFaith changed the title [WIP][data][feat] add large dataset support [data][feat] add large dataset support Jan 6, 2026
@PopSoda2002
Copy link
Collaborator

Thanks for your great work! I am not sure if we need to seperate datasets and SFT into different PRs, currently this PR seems too large for reviewing

I’ll work on some simplifications with @ChangyiYang today. Afterward, could you review it again and discuss if we should split this PR into two?

Yeah sure, definitely, always willing for help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants